{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# 基于字典的情感分析\n",
    "\n",
    "以下内容来自**邓旭东HIT** https://zhuanlan.zhihu.com/p/23225934\n",
    "\n",
    "情感分析就是分析一句话说得是很主观还是客观描述，分析这句话表达的是积极的情绪还是消极的情绪。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## 原理\n",
    "比如这么一句话：\n",
    "\n",
    "> “这手机的画面极好，操作也比较流畅。不过拍照真的太烂了！系统也不好。”\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "① 情感词\n",
    "\n",
    "要分析一句话是积极的还是消极的，最简单最基础的方法就是找出句子里面的情感词，积极的情感词比如：赞，好，顺手，华丽等，消极情感词比如：差，烂，坏，坑爹等。出现一个积极词就+1，出现一个消极词就-1。\n",
    "里面就有“好”，“流畅”两个积极情感词，“烂”一个消极情感词。那它的情感分值就是1+1-1+1=2. 很明显这个分值是不合理的，下面一步步修改它。\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "\n",
    "\n",
    "② 程度词\n",
    "\n",
    "“好”，“流畅”和‘烂“前面都有一个程度修饰词。”极好“就比”较好“或者”好“的情感更强，”太烂“也比”有点烂“情感强得多。所以需要在找到情感词后往前找一下有没有程度修饰，并给不同的程度一个权值。比如”极“，”无比“，”太“就要把情感分值*4，”较“，”还算“就情感分值*2，”只算“，”仅仅“这些就*0.5了。那么这句话的情感分值就是：4*1+1*2-1*4+1=3\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "\n",
    "\n",
    "③ 感叹号\n",
    "\n",
    "可以发现太烂了后面有感叹号，叹号意味着情感强烈。因此发现叹号可以为情感值+2. 那么这句话的情感分值就变成了：4*1+1*2-1*4-2+1 = 1\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "\n",
    "\n",
    "④ 否定词\n",
    "\n",
    "明眼人一眼就看出最后面那个”好“并不是表示”好“，因为前面还有一个”不“字。所以在找到情感词的时候，需要往前找否定词。比如”不“，”不能“这些词。而且还要数这些否定词出现的次数，如果是单数，情感分值就*-1，但如果是偶数，那情感就没有反转，还是*1。在这句话里面，可以看出”好“前面只有一个”不“，所以”好“的情感值应该反转，*-1。\n",
    "因此这句话的准确情感分值是：4*1+1*2-1*4-2+1*-1 = -1\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "\n",
    "⑤ 积极和消极分开来\n",
    "\n",
    "再接下来，很明显就可以看出，这句话里面有褒有贬，不能用一个分值来表示它的情感倾向。而且这个权值的设置也会影响最终的情感分值，敏感度太高了。因此对这句话的最终的正确的处理，是得出这句话的一个积极分值，一个消极分值（这样消极分值也是正数，无需使用负数了）。它们同时代表了这句话的情感倾向。所以这句评论应该是”积极分值：6，消极分值：7“\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "\n",
    "⑥ 以分句的情感为基础\n",
    "\n",
    "再仔细一步，详细一点，一条评论的情感分值是由不同的分句加起来的，因此要得到一条评论的情感分值，就要先计算出评论中每个句子的情感分值。这条例子评论有四个分句，因此其结构如下（[积极分值, 消极分值]）：[[4, 0], [2, 0], [0, 6], [0, 1]] \n",
    "\n",
    "以上就是使用情感词典来进行情感分析的主要流程了，算法的设计也会按照这个思路来实现。\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 算法设计\n",
    "- 第一步：读取评论数据，对评论进行分句。\n",
    "- 第二步：查找对分句的情感词，记录积极还是消极，以及位置。\n",
    "- 第三步：往情感词前查找程度词，找到就停止搜寻。为程度词设权值，乘以情感值。\n",
    "- 第四步：往情感词前查找否定词，找完全部否定词，若数量为奇数，乘以-1，若为偶数，乘以1。\n",
    "- 第五步：判断分句结尾是否有感叹号，有叹号则往前寻找情感词，有则相应的情感值+2。\n",
    "- 第六步：计算完一条评论所有分句的情感值，用数组（list）记录起来。\n",
    "- 第七步：计算并记录所有评论的情感值。\n",
    "- 第八步：通过分句计算每条评论的积极情感均值，消极情感均值，积极情感方差，消极情感方差。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-29T08:14:21.736663Z",
     "start_time": "2020-10-29T08:14:19.756842Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "import jieba\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-29T08:14:23.322973Z",
     "start_time": "2020-10-29T08:14:23.318583Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "#打开词典文件，返回列表\n",
    "def open_dict(Dict, path):\n",
    "    path = path + '%s.txt' % Dict\n",
    "    dictionary = open(path, 'r', encoding='utf-8')\n",
    "    dict = []\n",
    "    for word in dictionary:\n",
    "        word = word.strip('\\n')\n",
    "        dict.append(word)\n",
    "    return dict\n",
    "\n",
    "def judgeodd(num):\n",
    "    if (num % 2) == 0:\n",
    "        return 'even'\n",
    "    else:\n",
    "        return 'odd'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-29T08:15:09.330615Z",
     "start_time": "2020-10-29T08:15:09.317915Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "#修改成自己的path路径。\n",
    "deny_word = open_dict(Dict = '否定词', path= r'./Textmining/')\n",
    "posdict = open_dict(Dict = 'positive', path= r'./Textmining/')\n",
    "negdict = open_dict(Dict = 'negative', path= r'./Textmining/')\n",
    "degree_word = open_dict(Dict = '程度级别词语', path= r'./Textmining/')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-29T08:15:13.381343Z",
     "start_time": "2020-10-29T08:15:13.371308Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(225, ['extreme', '百分之百', '倍加'])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(degree_word), degree_word[:3]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-29T08:15:14.887484Z",
     "start_time": "2020-10-29T08:15:14.882921Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "mostdict = degree_word[degree_word.index('extreme')+1 : degree_word.index('very')]#权重4，即在情感词前乘以3\n",
    "verydict = degree_word[degree_word.index('very')+1 : degree_word.index('more')]#权重3\n",
    "moredict = degree_word[degree_word.index('more')+1 : degree_word.index('ish')]#权重2\n",
    "ishdict = degree_word[degree_word.index('ish')+1 : degree_word.index('last')]#权重0.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-29T08:15:15.999933Z",
     "start_time": "2020-10-29T08:15:15.995703Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['百分之百', '倍加', '备至', '不得了', '不堪']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mostdict[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-29T08:15:17.075090Z",
     "start_time": "2020-10-29T08:15:17.060367Z"
    },
    "code_folding": [],
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "def sentiment_score_list(dataset):\n",
    "    seg_sentence = dataset.split('。')\n",
    "    count1 = []\n",
    "    count2 = []\n",
    "    for sen in seg_sentence: #循环遍历每一个评论\n",
    "        segtmp = jieba.lcut(sen, cut_all=False)  #把句子进行分词，以列表的形式返回\n",
    "        i = 0 #记录扫描到的词的位置\n",
    "        a = 0 #记录情感词的位置\n",
    "        poscount = 0 #积极词的第一次分值\n",
    "        poscount2 = 0 #积极词反转后的分值\n",
    "        poscount3 = 0 #积极词的最后分值（包括叹号的分值）\n",
    "        negcount = 0\n",
    "        negcount2 = 0\n",
    "        negcount3 = 0\n",
    "        for word in segtmp:\n",
    "            if word in posdict:  # 判断词语是否是情感词\n",
    "                poscount += 1\n",
    "                c = 0\n",
    "                for w in segtmp[a:i]:  # 扫描情感词前的程度词\n",
    "                    if w in mostdict:\n",
    "                        poscount *= 4.0\n",
    "                    elif w in verydict:\n",
    "                        poscount *= 3.0\n",
    "                    elif w in moredict:\n",
    "                        poscount *= 2.0\n",
    "                    elif w in ishdict:\n",
    "                        poscount *= 0.5\n",
    "                    elif w in deny_word:\n",
    "                        c += 1\n",
    "                if judgeodd(c) == 'odd':  # 扫描情感词前的否定词数\n",
    "                    poscount *= -1.0\n",
    "                    poscount2 += poscount\n",
    "                    poscount = 0\n",
    "                    poscount3 = poscount + poscount2 + poscount3\n",
    "                    poscount2 = 0\n",
    "                else:\n",
    "                    poscount3 = poscount + poscount2 + poscount3\n",
    "                    poscount = 0\n",
    "                a = i + 1  # 情感词的位置变化\n",
    "\n",
    "            elif word in negdict:  # 消极情感的分析，与上面一致\n",
    "                negcount += 1\n",
    "                d = 0\n",
    "                for w in segtmp[a:i]:\n",
    "                    if w in mostdict:\n",
    "                        negcount *= 4.0\n",
    "                    elif w in verydict:\n",
    "                        negcount *= 3.0\n",
    "                    elif w in moredict:\n",
    "                        negcount *= 2.0\n",
    "                    elif w in ishdict:\n",
    "                        negcount *= 0.5\n",
    "                    elif w in degree_word:\n",
    "                        d += 1\n",
    "                if judgeodd(d) == 'odd':\n",
    "                    negcount *= -1.0\n",
    "                    negcount2 += negcount\n",
    "                    negcount = 0\n",
    "                    negcount3 = negcount + negcount2 + negcount3\n",
    "                    negcount2 = 0\n",
    "                else:\n",
    "                    negcount3 = negcount + negcount2 + negcount3\n",
    "                    negcount = 0\n",
    "                a = i + 1\n",
    "            elif word == '！' or word == '!':  ##判断句子是否有感叹号\n",
    "                for w2 in segtmp[::-1]:  # 扫描感叹号前的情感词，发现后权值+2，然后退出循环\n",
    "                    if w2 in posdict or negdict:\n",
    "                        poscount3 += 2\n",
    "                        negcount3 += 2\n",
    "                        break\n",
    "            i += 1 # 扫描词位置前移\n",
    "\n",
    "\n",
    "            # 防止出现负数的情况\n",
    "            pos_count = 0\n",
    "            neg_count = 0\n",
    "            if poscount3 < 0 and negcount3 > 0:\n",
    "                neg_count += negcount3 - poscount3\n",
    "                pos_count = 0\n",
    "            elif negcount3 < 0 and poscount3 > 0:\n",
    "                pos_count = poscount3 - negcount3\n",
    "                neg_count = 0\n",
    "            elif poscount3 < 0 and negcount3 < 0:\n",
    "                neg_count = -poscount3\n",
    "                pos_count = -negcount3\n",
    "            else:\n",
    "                pos_count = poscount3\n",
    "                neg_count = negcount3\n",
    "\n",
    "            count1.append([pos_count, neg_count])\n",
    "        count2.append(count1)\n",
    "        count1 = []\n",
    "\n",
    "    return count2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-29T08:22:06.055463Z",
     "start_time": "2020-10-29T08:22:06.051693Z"
    },
    "code_folding": [],
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "def sentiment_score(senti_score_list):\n",
    "    score = []\n",
    "    for review in senti_score_list:\n",
    "        Pos_list, Neg_list = np.array(review).T\n",
    "        Pos = np.sum(Pos_list)\n",
    "        Neg = np.sum(Neg_list)\n",
    "        score.append([Pos, Neg])\n",
    "    return score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-10-29T08:22:06.762784Z",
     "start_time": "2020-10-29T08:22:06.745477Z"
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[90.0, 189.0]]\n",
      "[[498.0, 48.0]]\n"
     ]
    }
   ],
   "source": [
    "data = '今天心情不好！股票又跌了，损失惨重,和女朋友也分手了！非常生气，我非常郁闷！！！！'\n",
    "data2= '我好开心啊，非常非常非常高兴！今天我得了一百分，我很兴奋开心，愉快，开心!!!'\n",
    "\n",
    "print(sentiment_score(sentiment_score_list(data)))\n",
    "print(sentiment_score(sentiment_score_list(data2)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
